Question 1

Repeat the analysis shown in the R lab of this chapter, but use TOTAL13 as the outcome variable. Please build both the regression model and the decision tree model (for regression). Identify the final models you would select, evaluate the models, and compare the regression model with the tree model.

library(RCurl)
## Loading required package: bitops
AD <- read.csv(text=getURL("https://raw.githubusercontent.com/shuailab/ind_498/master/resource/data/AD2.csv"))
AD$ID = c(1:dim(AD)[1])
str(AD)
## 'data.frame':    517 obs. of  18 variables:
##  $ AGE       : num  71.7 77.7 72.8 69.6 70.9 65.1 79.6 73.6 60.7 70.6 ...
##  $ PTGENDER  : int  2 1 2 1 1 2 2 2 1 2 ...
##  $ PTEDUCAT  : int  14 18 18 13 13 20 20 18 19 18 ...
##  $ FDG       : num  6.82 6.37 6.37 6.37 6.37 ...
##  $ AV45      : num  1.11 1.11 1.11 1.11 1.11 ...
##  $ HippoNV   : num  0.529 0.538 0.269 0.576 0.601 ...
##  $ e2_1      : int  1 0 0 0 1 0 0 0 0 1 ...
##  $ e4_1      : int  0 0 1 0 0 1 1 1 1 1 ...
##  $ rs3818361 : int  1 1 1 1 1 1 1 1 0 0 ...
##  $ rs744373  : int  1 0 1 1 1 0 1 1 0 1 ...
##  $ rs11136000: int  1 1 1 1 1 0 0 1 0 0 ...
##  $ rs610932  : int  1 1 0 1 0 1 1 1 0 1 ...
##  $ rs3851179 : int  1 0 1 0 0 1 0 0 1 0 ...
##  $ rs3764650 : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ rs3865444 : int  0 1 1 0 0 0 1 1 1 0 ...
##  $ MMSCORE   : int  26 30 30 28 29 30 30 27 28 30 ...
##  $ TOTAL13   : num  8 1.67 12 3 10 3.67 4 11 3 9 ...
##  $ ID        : int  1 2 3 4 5 6 7 8 9 10 ...
AD_demo <- subset(AD, select=c("TOTAL13", "AGE","PTGENDER","PTEDUCAT","ID"))
str(AD_demo)
## 'data.frame':    517 obs. of  5 variables:
##  $ TOTAL13 : num  8 1.67 12 3 10 3.67 4 11 3 9 ...
##  $ AGE     : num  71.7 77.7 72.8 69.6 70.9 65.1 79.6 73.6 60.7 70.6 ...
##  $ PTGENDER: int  2 1 2 1 1 2 2 2 1 2 ...
##  $ PTEDUCAT: int  14 18 18 13 13 20 20 18 19 18 ...
##  $ ID      : int  1 2 3 4 5 6 7 8 9 10 ...
library(ggplot2)
p <- ggplot(AD_demo, aes(x = PTEDUCAT, y = TOTAL13))
p <- p + geom_point(size=4)
p <- p + labs(title="TOTAL13 versus PTEDUCAT")
print(p)

Scatter plot “TOTAL13 versus PTEDUCAT” shows a weak positive relationship between predictors with TOTAL13

library(ggplot2)
p <- ggplot(AD_demo, aes(x = AGE, y = TOTAL13))
p <- p + geom_point(size=4)
p <- p + labs(title="TOTAL13 versus AGE")
print(p)

Scatter plot “TOTAL13 versus AGE” shows a weak positive relationship between predictors with TOTAL13

# fit a simple linear regression model with AGE
library(car)
lm.AD_demo <- lm(TOTAL13 ~ AGE, data = AD_demo)
summary(lm.AD_demo)
## 
## Call:
## lm(formula = TOTAL13 ~ AGE, data = AD_demo)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -19.777  -5.598  -1.834   4.063  39.299 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   0.2456     3.5461   0.069 0.944819    
## AGE           0.1887     0.0486   3.884 0.000116 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.913 on 515 degrees of freedom
## Multiple R-squared:  0.02845,    Adjusted R-squared:  0.02657 
## F-statistic: 15.08 on 1 and 515 DF,  p-value: 0.0001164

‘AGE’ is statistically significant with p-value of 0.000116, rejecting null hypothesis (H0: no relationship between TOTAL 13 and AGE). R-squared is 0.02845, indicating that ‘AGE’ predictor represents only 2.8% of variability in TOTAL 13

lm.AD_demo2 <- lm(TOTAL13 ~ AGE + PTGENDER + PTEDUCAT, data = AD_demo)
summary(lm.AD_demo2)
## 
## Call:
## lm(formula = TOTAL13 ~ AGE + PTGENDER + PTEDUCAT, data = AD_demo)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -17.691  -5.450  -1.972   4.124  38.582 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept)  4.52738    4.21805   1.073  0.28363   
## AGE          0.16025    0.04867   3.292  0.00106 **
## PTGENDER     2.18848    0.71130   3.077  0.00220 **
## PTEDUCAT    -0.34519    0.13026  -2.650  0.00830 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.825 on 513 degrees of freedom
## Multiple R-squared:  0.05359,    Adjusted R-squared:  0.04806 
## F-statistic: 9.684 on 3 and 513 DF,  p-value: 3.164e-06

With all three demographics varibles included into the model, R-squared value increased to 0.05359, indicating 5.4% of variability in TOTAL 13 can be explained by the three variables. P-value of all three variabbles are significant as their p-values are 0.00106, 0.00220,0.00830, all less than 0.01.

require(ggplot2)
p <- ggplot(AD_demo, aes(x = PTEDUCAT, y = TOTAL13))
p <- p + geom_point(aes(colour=AGE), size=2)
p <- p + labs(title="TOTAL13 versus PTEDUCAT")
print(p)

Because the relationship between Total 13 and PTEDUCAT changes according to the levels of AGE, the same scatterplot on two levels of age can be examined.

p <- ggplot(AD_demo[which(AD_demo$AGE < 60),], aes(x = PTEDUCAT, y = TOTAL13))
p <- p + geom_point(size=2)
p <- p + geom_smooth(method = lm)
p <- p + labs(title="TOTAL13 versus PTEDUCAT when AGE < 60")
print(p)

p <- ggplot(AD_demo[which(AD_demo$AGE > 80),], aes(x = PTEDUCAT, y = TOTAL13))
p <- p + geom_point(size=2)
p <- p + geom_smooth(method = lm)
p <- p + labs(title="TOTAL13 versus PTEDUCAT when AGE > 80")
print(p)

TOTAL13 shows very weak positive correlation with PTEDUCAT when AGE < 60. On the other hand, TOTAL13 is negatively correlated with PTEDUCAT for those who are older than 80 years old.

p <- ggplot(AD_demo, aes(x = AGE, y = TOTAL13))
p <- p + geom_point(size=2)
p <- p + labs(title="TOTAL13 versus Age")
print(p)

#for male
p <- ggplot(AD_demo[which(AD_demo$PTGENDER == 1),], aes(x = AGE, y = TOTAL13))
p <- p + geom_point(size=2)
p <- p + labs(title="TOTAL13 versus Age - male")
print(p)

# for female
p <- ggplot(AD_demo[which(AD_demo$PTGENDER == 2),], aes(x = AGE, y = TOTAL13))
p <- p + geom_point(size=2)
p <- p + labs(title="TOTAL13 versus Age - female")
print(p)

Scatter plot between AGE and TOTAL13 for male shows more spread out patterns than those for female.

# inlcuding interaction term: AGE*PTEDUCAT
lm.AD_demo2 <- lm(TOTAL13 ~ AGE + PTGENDER + PTEDUCAT + AGE*PTEDUCAT, data = AD_demo)
summary(lm.AD_demo2)
## 
## Call:
## lm(formula = TOTAL13 ~ AGE + PTGENDER + PTEDUCAT + AGE * PTEDUCAT, 
##     data = AD_demo)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -17.259  -5.233  -1.888   3.890  38.180 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)   
## (Intercept)  -19.61428   20.96925  -0.935  0.35003   
## AGE            0.48724    0.28244   1.725  0.08511 . 
## PTGENDER       2.25736    0.71344   3.164  0.00165 **
## PTEDUCAT       1.15343    1.28174   0.900  0.36860   
## AGE:PTEDUCAT  -0.02042    0.01737  -1.175  0.24042   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.822 on 512 degrees of freedom
## Multiple R-squared:  0.05614,    Adjusted R-squared:  0.04877 
## F-statistic: 7.613 on 4 and 512 DF,  p-value: 5.795e-06

Including the interaction term increased R-squared value from 0.05359 (from the AD_demo without the interation term) to 0.05614. However, the interaction term is not statistically significant to reject the null hypothesis because the p-value was 0.24, more than 0.05.

# Diagnostics graphs:
require("ggfortify")
## Loading required package: ggfortify
## Warning: namespace 'DBI' is not available and has been replaced
## by .GlobalEnv when processing object 'quiet'

## Warning: namespace 'DBI' is not available and has been replaced
## by .GlobalEnv when processing object 'quiet'
autoplot(lm.AD_demo2, which = 1:6, ncol = 3, label.size = 3)

Residuals vs fitted values is used to detect non-linearity, unequal error variances and outliers. Our graph shows no significant patterns, indicating the model is fairly fit. The residuals is clustered around the fitted line. Scale-Location and standarized resial shows no significant patterns either.

Normal Q-Q shows if the data came from some theoretical distribution (ex. normal or exponential). Our graph is not perfectly straight, showing the opportunity to improve the model.

We tried add more predictors to improve the model:

# try full-scale model - exclude MMSCORE as it is other output 
AD_full <- AD[,c(1:17)]
AD_full <- subset(AD_full, select = -c(MMSCORE) )
names(AD_full)
##  [1] "AGE"        "PTGENDER"   "PTEDUCAT"   "FDG"        "AV45"      
##  [6] "HippoNV"    "e2_1"       "e4_1"       "rs3818361"  "rs744373"  
## [11] "rs11136000" "rs610932"   "rs3851179"  "rs3764650"  "rs3865444" 
## [16] "TOTAL13"
lm.AD <- lm(TOTAL13 ~ ., data = AD_full)
summary(lm.AD)
## 
## Call:
## lm(formula = TOTAL13 ~ ., data = AD_full)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17.4440  -4.1462  -0.4275   3.7954  24.8925 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  53.83420    5.91765   9.097  < 2e-16 ***
## AGE          -0.03661    0.04175  -0.877 0.380900    
## PTGENDER      0.76054    0.57025   1.334 0.182908    
## PTEDUCAT     -0.34366    0.10275  -3.345 0.000886 ***
## FDG          -3.89724    0.46608  -8.362 6.15e-16 ***
## AV45          7.06724    1.49597   4.724 3.01e-06 ***
## HippoNV     -36.24541    4.20495  -8.620  < 2e-16 ***
## e2_1         -0.33649    0.95733  -0.351 0.725373    
## e4_1         -0.04923    0.61540  -0.080 0.936268    
## rs3818361    -0.61150    0.57083  -1.071 0.284570    
## rs744373     -0.18606    0.54306  -0.343 0.732032    
## rs11136000   -0.32233    0.57256  -0.563 0.573713    
## rs610932      1.13649    0.56506   2.011 0.044832 *  
## rs3851179    -0.33988    0.54648  -0.622 0.534264    
## rs3764650     0.65356    0.68652   0.952 0.341558    
## rs3865444     0.93384    0.53949   1.731 0.084072 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.079 on 501 degrees of freedom
## Multiple R-squared:  0.4423, Adjusted R-squared:  0.4256 
## F-statistic: 26.48 on 15 and 501 DF,  p-value: < 2.2e-16

To predict TOTAL13 Value, a full model was built with all the demographics,genetics and imaging variables.PTEDUCAT, FDG, AV45, Hipponv, rs610932 are significant based on p-values. R-squared is now increased to 0.4423, indicating 44% of the variability in TOTAL13 can be explained by the variables.

# try taking AGE out to find differences
lm.AD.reduced <- lm.AD;
lm.AD.reduced <- update(lm.AD.reduced, ~ . - AGE); 
summary(lm.AD.reduced);
## 
## Call:
## lm(formula = TOTAL13 ~ PTGENDER + PTEDUCAT + FDG + AV45 + HippoNV + 
##     e2_1 + e4_1 + rs3818361 + rs744373 + rs11136000 + rs610932 + 
##     rs3851179 + rs3764650 + rs3865444, data = AD_full)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -16.9824  -4.1548  -0.4448   3.7769  25.3646 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  50.57095    4.60043  10.993  < 2e-16 ***
## PTGENDER      0.73102    0.56912   1.284  0.19957    
## PTEDUCAT     -0.33283    0.10198  -3.264  0.00117 ** 
## FDG          -3.90679    0.46584  -8.387 5.10e-16 ***
## AV45          6.94916    1.48956   4.665 3.96e-06 ***
## HippoNV     -34.92868    3.92688  -8.895  < 2e-16 ***
## e2_1         -0.32283    0.95699  -0.337  0.73600    
## e4_1          0.06169    0.60213   0.102  0.91844    
## rs3818361    -0.59838    0.57050  -1.049  0.29474    
## rs744373     -0.18304    0.54292  -0.337  0.73616    
## rs11136000   -0.33890    0.57211  -0.592  0.55387    
## rs610932      1.15310    0.56461   2.042  0.04165 *  
## rs3851179    -0.33920    0.54636  -0.621  0.53498    
## rs3764650     0.64759    0.68633   0.944  0.34585    
## rs3865444     0.93810    0.53934   1.739  0.08259 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.078 on 502 degrees of freedom
## Multiple R-squared:  0.4414, Adjusted R-squared:  0.4258 
## F-statistic: 28.33 on 14 and 502 DF,  p-value: < 2.2e-16

R-squared value was slightly reduced from 0.4423 to 0.4414, but almost no effects. We can use F-test to compare the full model with this new model by applying anova() function.

anova(lm.AD.reduced,lm.AD)
## Analysis of Variance Table
## 
## Model 1: TOTAL13 ~ PTGENDER + PTEDUCAT + FDG + AV45 + HippoNV + e2_1 + 
##     e4_1 + rs3818361 + rs744373 + rs11136000 + rs610932 + rs3851179 + 
##     rs3764650 + rs3865444
## Model 2: TOTAL13 ~ AGE + PTGENDER + PTEDUCAT + FDG + AV45 + HippoNV + 
##     e2_1 + e4_1 + rs3818361 + rs744373 + rs11136000 + rs610932 + 
##     rs3851179 + rs3764650 + rs3865444
##   Res.Df   RSS Df Sum of Sq      F Pr(>F)
## 1    502 18542                           
## 2    501 18514  1    28.422 0.7692 0.3809

By F-test, p-velue is 0.7692 which indicates that two models are statisticaly indistinguishable. We tried to remove the latest last significant predictor, e4_1.

lm.AD.reduced <- update(lm.AD.reduced, ~ . - e4_1); 
summary(lm.AD.reduced);
## 
## Call:
## lm(formula = TOTAL13 ~ PTGENDER + PTEDUCAT + FDG + AV45 + HippoNV + 
##     e2_1 + rs3818361 + rs744373 + rs11136000 + rs610932 + rs3851179 + 
##     rs3764650 + rs3865444, data = AD_full)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -16.945  -4.170  -0.439   3.767  25.314 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  50.5787     4.5953  11.007  < 2e-16 ***
## PTGENDER      0.7292     0.5683   1.283  0.20001    
## PTEDUCAT     -0.3324     0.1018  -3.265  0.00117 ** 
## FDG          -3.9138     0.4603  -8.503  < 2e-16 ***
## AV45          6.9930     1.4254   4.906 1.26e-06 ***
## HippoNV     -34.9195     3.9220  -8.904  < 2e-16 ***
## e2_1         -0.3387     0.9434  -0.359  0.71970    
## rs3818361    -0.5944     0.5686  -1.045  0.29636    
## rs744373     -0.1811     0.5421  -0.334  0.73841    
## rs11136000   -0.3375     0.5714  -0.591  0.55496    
## rs610932      1.1536     0.5640   2.045  0.04135 *  
## rs3851179    -0.3403     0.5457  -0.624  0.53314    
## rs3764650     0.6482     0.6856   0.945  0.34491    
## rs3865444     0.9398     0.5386   1.745  0.08159 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.072 on 503 degrees of freedom
## Multiple R-squared:  0.4414, Adjusted R-squared:  0.4269 
## F-statistic: 30.57 on 13 and 503 DF,  p-value: < 2.2e-16
anova(lm.AD.reduced,lm.AD)
## Analysis of Variance Table
## 
## Model 1: TOTAL13 ~ PTGENDER + PTEDUCAT + FDG + AV45 + HippoNV + e2_1 + 
##     rs3818361 + rs744373 + rs11136000 + rs610932 + rs3851179 + 
##     rs3764650 + rs3865444
## Model 2: TOTAL13 ~ AGE + PTGENDER + PTEDUCAT + FDG + AV45 + HippoNV + 
##     e2_1 + e4_1 + rs3818361 + rs744373 + rs11136000 + rs610932 + 
##     rs3851179 + rs3764650 + rs3865444
##   Res.Df   RSS Df Sum of Sq      F Pr(>F)
## 1    503 18542                           
## 2    501 18514  2     28.81 0.3898 0.6774

We can repeat this process until no more variable could be deleted or use step() function to achieve the automation of this.

# model selection
lm.AD.F <- step(lm.AD, direction="backward", test="F")
## Start:  AIC=1881.94
## TOTAL13 ~ AGE + PTGENDER + PTEDUCAT + FDG + AV45 + HippoNV + 
##     e2_1 + e4_1 + rs3818361 + rs744373 + rs11136000 + rs610932 + 
##     rs3851179 + rs3764650 + rs3865444
## 
##              Df Sum of Sq   RSS    AIC F value    Pr(>F)    
## - e4_1        1      0.24 18514 1879.9  0.0064 0.9362678    
## - rs744373    1      4.34 18518 1880.1  0.1174 0.7320322    
## - e2_1        1      4.57 18518 1880.1  0.1235 0.7253731    
## - rs11136000  1     11.71 18525 1880.3  0.3169 0.5737126    
## - rs3851179   1     14.29 18528 1880.3  0.3868 0.5342644    
## - AGE         1     28.42 18542 1880.7  0.7692 0.3809001    
## - rs3764650   1     33.49 18547 1880.9  0.9063 0.3415580    
## - rs3818361   1     42.41 18556 1881.1  1.1476 0.2845704    
## - PTGENDER    1     65.73 18579 1881.8  1.7788 0.1829079    
## <none>                    18514 1881.9                      
## - rs3865444   1    110.72 18624 1883.0  2.9963 0.0840724 .  
## - rs610932    1    149.48 18663 1884.1  4.0452 0.0448324 *  
## - PTEDUCAT    1    413.38 18927 1891.3 11.1867 0.0008856 ***
## - AV45        1    824.71 19338 1902.5 22.3178 3.005e-06 ***
## - FDG         1   2583.75 21097 1947.5 69.9200 6.153e-16 ***
## - HippoNV     1   2745.58 21259 1951.4 74.2992 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Step:  AIC=1879.94
## TOTAL13 ~ AGE + PTGENDER + PTEDUCAT + FDG + AV45 + HippoNV + 
##     e2_1 + rs3818361 + rs744373 + rs11136000 + rs610932 + rs3851179 + 
##     rs3764650 + rs3865444
## 
##              Df Sum of Sq   RSS    AIC F value    Pr(>F)    
## - e2_1        1      4.35 18518 1878.1  0.1179 0.7314465    
## - rs744373    1      4.41 18518 1878.1  0.1195 0.7296991    
## - rs11136000  1     11.82 18526 1878.3  0.3205 0.5715653    
## - rs3851179   1     14.23 18528 1878.3  0.3857 0.5348293    
## - AGE         1     28.57 18542 1878.7  0.7748 0.3791652    
## - rs3764650   1     33.44 18547 1878.9  0.9066 0.3414752    
## - rs3818361   1     42.95 18557 1879.1  1.1647 0.2810068    
## - PTGENDER    1     65.89 18580 1879.8  1.7866 0.1819460    
## <none>                    18514 1879.9                      
## - rs3865444   1    110.52 18624 1881.0  2.9968 0.0840459 .  
## - rs610932    1    149.46 18663 1882.1  4.0526 0.0446365 *  
## - PTEDUCAT    1    413.67 18927 1889.4 11.2166 0.0008716 ***
## - AV45        1    896.25 19410 1902.4 24.3019 1.121e-06 ***
## - FDG         1   2628.05 21142 1946.6 71.2598 3.378e-16 ***
## - HippoNV     1   2750.50 21264 1949.5 74.5799 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Step:  AIC=1878.06
## TOTAL13 ~ AGE + PTGENDER + PTEDUCAT + FDG + AV45 + HippoNV + 
##     rs3818361 + rs744373 + rs11136000 + rs610932 + rs3851179 + 
##     rs3764650 + rs3865444
## 
##              Df Sum of Sq   RSS    AIC F value    Pr(>F)    
## - rs744373    1      4.12 18522 1876.2  0.1120 0.7379869    
## - rs11136000  1     11.90 18530 1876.4  0.3232 0.5699302    
## - rs3851179   1     12.73 18531 1876.4  0.3457 0.5567995    
## - AGE         1     28.98 18547 1876.9  0.7871 0.3754009    
## - rs3764650   1     33.82 18552 1877.0  0.9187 0.3382775    
## - rs3818361   1     41.72 18560 1877.2  1.1332 0.2876095    
## - PTGENDER    1     65.01 18583 1877.9  1.7658 0.1845085    
## <none>                    18518 1878.1                      
## - rs3865444   1    110.49 18628 1879.1  3.0012 0.0838175 .  
## - rs610932    1    150.64 18669 1880.2  4.0918 0.0436207 *  
## - PTEDUCAT    1    412.99 18931 1887.5 11.2179 0.0008709 ***
## - AV45        1    917.13 19435 1901.0 24.9117 8.285e-07 ***
## - FDG         1   2626.99 21145 1944.7 71.3560 3.223e-16 ***
## - HippoNV     1   2767.45 21286 1948.1 75.1714 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Step:  AIC=1876.18
## TOTAL13 ~ AGE + PTGENDER + PTEDUCAT + FDG + AV45 + HippoNV + 
##     rs3818361 + rs11136000 + rs610932 + rs3851179 + rs3764650 + 
##     rs3865444
## 
##              Df Sum of Sq   RSS    AIC F value    Pr(>F)    
## - rs3851179   1     11.94 18534 1874.5  0.3250 0.5688721    
## - rs11136000  1     11.98 18534 1874.5  0.3260 0.5682597    
## - AGE         1     28.68 18551 1875.0  0.7804 0.3774352    
## - rs3764650   1     32.94 18555 1875.1  0.8964 0.3442065    
## - rs3818361   1     42.02 18564 1875.3  1.1434 0.2854451    
## - PTGENDER    1     65.45 18588 1876.0  1.7810 0.1826239    
## <none>                    18522 1876.2                      
## - rs3865444   1    112.37 18634 1877.3  3.0576 0.0809696 .  
## - rs610932    1    149.72 18672 1878.3  4.0741 0.0440754 *  
## - PTEDUCAT    1    408.92 18931 1885.5 11.1269 0.0009136 ***
## - AV45        1    922.51 19445 1899.3 25.1021 7.537e-07 ***
## - FDG         1   2627.13 21149 1942.8 71.4857 3.029e-16 ***
## - HippoNV     1   2765.66 21288 1946.1 75.2552 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Step:  AIC=1874.51
## TOTAL13 ~ AGE + PTGENDER + PTEDUCAT + FDG + AV45 + HippoNV + 
##     rs3818361 + rs11136000 + rs610932 + rs3764650 + rs3865444
## 
##              Df Sum of Sq   RSS    AIC F value    Pr(>F)    
## - rs11136000  1     10.75 18545 1872.8  0.2928 0.5886675    
## - AGE         1     28.74 18563 1873.3  0.7831 0.3766019    
## - rs3764650   1     31.87 18566 1873.4  0.8684 0.3518371    
## - rs3818361   1     38.98 18573 1873.6  1.0621 0.3032322    
## - PTGENDER    1     65.48 18600 1874.3  1.7842 0.1822318    
## <none>                    18534 1874.5                      
## - rs3865444   1    111.00 18645 1875.6  3.0245 0.0826260 .  
## - rs610932    1    153.95 18688 1876.8  4.1947 0.0410676 *  
## - PTEDUCAT    1    406.39 18940 1883.7 11.0728 0.0009399 ***
## - AV45        1    936.07 19470 1898.0 25.5052 6.173e-07 ***
## - FDG         1   2616.06 21150 1940.8 71.2800 3.305e-16 ***
## - HippoNV     1   2782.84 21317 1944.8 75.8241 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Step:  AIC=1872.81
## TOTAL13 ~ AGE + PTGENDER + PTEDUCAT + FDG + AV45 + HippoNV + 
##     rs3818361 + rs610932 + rs3764650 + rs3865444
## 
##             Df Sum of Sq   RSS    AIC F value    Pr(>F)    
## - AGE        1     29.74 18575 1871.6  0.8115  0.368107    
## - rs3764650  1     32.16 18577 1871.7  0.8775  0.349347    
## - rs3818361  1     34.99 18580 1871.8  0.9548  0.328962    
## - PTGENDER   1     64.88 18610 1872.6  1.7703  0.183940    
## <none>                   18545 1872.8                      
## - rs3865444  1    109.13 18654 1873.8  2.9777  0.085032 .  
## - rs610932   1    154.78 18700 1875.1  4.2233  0.040385 *  
## - PTEDUCAT   1    403.69 18949 1881.9 11.0147  0.000969 ***
## - AV45       1    971.95 19517 1897.2 26.5197 3.743e-07 ***
## - FDG        1   2608.87 21154 1938.9 71.1834 3.435e-16 ***
## - HippoNV    1   2787.67 21332 1943.2 76.0621 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Step:  AIC=1871.64
## TOTAL13 ~ PTGENDER + PTEDUCAT + FDG + AV45 + HippoNV + rs3818361 + 
##     rs610932 + rs3764650 + rs3865444
## 
##             Df Sum of Sq   RSS    AIC F value    Pr(>F)    
## - rs3764650  1     31.73 18606 1870.5  0.8661  0.352482    
## - rs3818361  1     32.25 18607 1870.5  0.8804  0.348546    
## - PTGENDER   1     59.54 18634 1871.3  1.6252  0.202946    
## <none>                   18575 1871.6                      
## - rs3865444  1    110.73 18685 1872.7  3.0224  0.082730 .  
## - rs610932   1    159.88 18734 1874.1  4.3639  0.037207 *  
## - PTEDUCAT   1    382.68 18957 1880.2 10.4455  0.001309 ** 
## - AV45       1    963.61 19538 1895.8 26.3020 4.163e-07 ***
## - FDG        1   2645.81 21220 1938.5 72.2183 < 2.2e-16 ***
## - HippoNV    1   2957.93 21532 1946.0 80.7377 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Step:  AIC=1870.52
## TOTAL13 ~ PTGENDER + PTEDUCAT + FDG + AV45 + HippoNV + rs3818361 + 
##     rs610932 + rs3865444
## 
##             Df Sum of Sq   RSS    AIC F value    Pr(>F)    
## - rs3818361  1     34.38 18641 1869.5  0.9387   0.33306    
## - PTGENDER   1     60.70 18667 1870.2  1.6573   0.19855    
## <none>                   18606 1870.5                      
## - rs3865444  1    112.69 18719 1871.6  3.0766   0.08003 .  
## - rs610932   1    170.43 18777 1873.2  4.6532   0.03146 *  
## - PTEDUCAT   1    373.19 18980 1878.8 10.1890   0.00150 ** 
## - AV45       1    961.94 19568 1894.6 26.2633 4.240e-07 ***
## - FDG        1   2636.85 21243 1937.0 71.9926 2.378e-16 ***
## - HippoNV    1   2947.01 21553 1944.5 80.4608 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Step:  AIC=1869.48
## TOTAL13 ~ PTGENDER + PTEDUCAT + FDG + AV45 + HippoNV + rs610932 + 
##     rs3865444
## 
##             Df Sum of Sq   RSS    AIC F value    Pr(>F)    
## - PTGENDER   1     59.51 18700 1869.1  1.6251  0.202963    
## <none>                   18641 1869.5                      
## - rs3865444  1    118.62 18759 1870.8  3.2391  0.072494 .  
## - rs610932   1    165.19 18806 1872.0  4.5106  0.034168 *  
## - PTEDUCAT   1    361.40 19002 1877.4  9.8682  0.001779 ** 
## - AV45       1    951.70 19592 1893.2 25.9870 4.855e-07 ***
## - FDG        1   2661.14 21302 1936.5 72.6645 < 2.2e-16 ***
## - HippoNV    1   2926.99 21568 1942.9 79.9238 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Step:  AIC=1869.12
## TOTAL13 ~ PTEDUCAT + FDG + AV45 + HippoNV + rs610932 + rs3865444
## 
##             Df Sum of Sq   RSS    AIC F value    Pr(>F)    
## <none>                   18700 1869.1                      
## - rs3865444  1     132.2 18832 1870.8  3.6041  0.058203 .  
## - rs610932   1     154.5 18855 1871.4  4.2137  0.040610 *  
## - PTEDUCAT   1     317.9 19018 1875.8  8.6699  0.003383 ** 
## - AV45       1     907.3 19608 1891.6 24.7451 8.958e-07 ***
## - FDG        1    2768.2 21468 1938.5 75.4946 < 2.2e-16 ***
## - HippoNV    1    3216.4 21917 1949.2 87.7196 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
summary(lm.AD.F)
## 
## Call:
## lm(formula = TOTAL13 ~ PTEDUCAT + FDG + AV45 + HippoNV + rs610932 + 
##     rs3865444, data = AD_full)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -15.7759  -4.3069  -0.4518   3.9067  25.5206 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  51.19143    4.23927  12.076  < 2e-16 ***
## PTEDUCAT     -0.29086    0.09878  -2.944  0.00338 ** 
## FDG          -3.95981    0.45574  -8.689  < 2e-16 ***
## AV45          6.92069    1.39125   4.974 8.96e-07 ***
## HippoNV     -35.82765    3.82534  -9.366  < 2e-16 ***
## rs610932      1.14763    0.55908   2.053  0.04061 *  
## rs3865444     1.01433    0.53430   1.898  0.05820 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.055 on 510 degrees of freedom
## Multiple R-squared:  0.4366, Adjusted R-squared:   0.43 
## F-statistic: 65.88 on 6 and 510 DF,  p-value: < 2.2e-16
anova(lm.AD.F ,lm.AD)
## Analysis of Variance Table
## 
## Model 1: TOTAL13 ~ PTEDUCAT + FDG + AV45 + HippoNV + rs610932 + rs3865444
## Model 2: TOTAL13 ~ AGE + PTGENDER + PTEDUCAT + FDG + AV45 + HippoNV + 
##     e2_1 + e4_1 + rs3818361 + rs744373 + rs11136000 + rs610932 + 
##     rs3851179 + rs3764650 + rs3865444
##   Res.Df   RSS Df Sum of Sq      F Pr(>F)
## 1    510 18700                           
## 2    501 18514  9    186.77 0.5616 0.8287

By using final 6 predictors, R-squared value is 0.4366, which is not too far off from the R-squared value of 0.4423 that was resulted by the model using all predictors. By applying F-test using anova() function, two models are statistically indistingushiable with p-value as 0.8287.

# Diagnostics graphs:
library("ggfortify")
autoplot(lm.AD.F, which = 1:6, ncol = 3, label.size = 3)

Residuals vs fitted graph and scale location graph still do not show significant patterns in the graphs, indicating the assumption of non-linearity was not violated significantly.Normal Q-Q plot for the new model improved a lot from the previous model with only demographic predictors.

FINAL MODEL

Therefore, we decided to select the model with the 6 predictors (PTEDUCAT,FDG,AV45,HippoNV,rs610932 ,rs3865444) as the final model.

# Evaluate the variable importance by all subsets regression
# install.packages("leaps")
library(leaps)
leaps<-regsubsets(TOTAL13 ~ ., data = AD_full,nbest=4)
# view results 
summary(leaps)
## Subset selection object
## Call: regsubsets.formula(TOTAL13 ~ ., data = AD_full, nbest = 4)
## 15 Variables  (and intercept)
##            Forced in Forced out
## AGE            FALSE      FALSE
## PTGENDER       FALSE      FALSE
## PTEDUCAT       FALSE      FALSE
## FDG            FALSE      FALSE
## AV45           FALSE      FALSE
## HippoNV        FALSE      FALSE
## e2_1           FALSE      FALSE
## e4_1           FALSE      FALSE
## rs3818361      FALSE      FALSE
## rs744373       FALSE      FALSE
## rs11136000     FALSE      FALSE
## rs610932       FALSE      FALSE
## rs3851179      FALSE      FALSE
## rs3764650      FALSE      FALSE
## rs3865444      FALSE      FALSE
## 4 subsets of each size up to 8
## Selection Algorithm: exhaustive
##          AGE PTGENDER PTEDUCAT FDG AV45 HippoNV e2_1 e4_1 rs3818361
## 1  ( 1 ) " " " "      " "      "*" " "  " "     " "  " "  " "      
## 1  ( 2 ) " " " "      " "      " " " "  "*"     " "  " "  " "      
## 1  ( 3 ) " " " "      " "      " " "*"  " "     " "  " "  " "      
## 1  ( 4 ) " " " "      " "      " " " "  " "     " "  "*"  " "      
## 2  ( 1 ) " " " "      " "      "*" " "  "*"     " "  " "  " "      
## 2  ( 2 ) " " " "      " "      " " "*"  "*"     " "  " "  " "      
## 2  ( 3 ) " " " "      " "      "*" "*"  " "     " "  " "  " "      
## 2  ( 4 ) "*" " "      " "      "*" " "  " "     " "  " "  " "      
## 3  ( 1 ) " " " "      " "      "*" "*"  "*"     " "  " "  " "      
## 3  ( 2 ) " " " "      "*"      "*" " "  "*"     " "  " "  " "      
## 3  ( 3 ) " " " "      " "      "*" " "  "*"     " "  " "  " "      
## 3  ( 4 ) " " " "      " "      "*" " "  "*"     " "  " "  " "      
## 4  ( 1 ) " " " "      "*"      "*" "*"  "*"     " "  " "  " "      
## 4  ( 2 ) " " " "      " "      "*" "*"  "*"     " "  " "  " "      
## 4  ( 3 ) " " " "      " "      "*" "*"  "*"     " "  " "  " "      
## 4  ( 4 ) " " " "      " "      "*" "*"  "*"     " "  " "  " "      
## 5  ( 1 ) " " " "      "*"      "*" "*"  "*"     " "  " "  " "      
## 5  ( 2 ) " " " "      "*"      "*" "*"  "*"     " "  " "  " "      
## 5  ( 3 ) " " "*"      "*"      "*" "*"  "*"     " "  " "  " "      
## 5  ( 4 ) " " " "      "*"      "*" "*"  "*"     " "  " "  " "      
## 6  ( 1 ) " " " "      "*"      "*" "*"  "*"     " "  " "  " "      
## 6  ( 2 ) " " "*"      "*"      "*" "*"  "*"     " "  " "  " "      
## 6  ( 3 ) " " " "      "*"      "*" "*"  "*"     " "  " "  "*"      
## 6  ( 4 ) " " " "      "*"      "*" "*"  "*"     " "  " "  " "      
## 7  ( 1 ) " " "*"      "*"      "*" "*"  "*"     " "  " "  " "      
## 7  ( 2 ) " " " "      "*"      "*" "*"  "*"     " "  " "  " "      
## 7  ( 3 ) " " " "      "*"      "*" "*"  "*"     " "  " "  "*"      
## 7  ( 4 ) "*" " "      "*"      "*" "*"  "*"     " "  " "  " "      
## 8  ( 1 ) " " "*"      "*"      "*" "*"  "*"     " "  " "  "*"      
## 8  ( 2 ) " " "*"      "*"      "*" "*"  "*"     " "  " "  " "      
## 8  ( 3 ) "*" "*"      "*"      "*" "*"  "*"     " "  " "  " "      
## 8  ( 4 ) " " "*"      "*"      "*" "*"  "*"     " "  " "  " "      
##          rs744373 rs11136000 rs610932 rs3851179 rs3764650 rs3865444
## 1  ( 1 ) " "      " "        " "      " "       " "       " "      
## 1  ( 2 ) " "      " "        " "      " "       " "       " "      
## 1  ( 3 ) " "      " "        " "      " "       " "       " "      
## 1  ( 4 ) " "      " "        " "      " "       " "       " "      
## 2  ( 1 ) " "      " "        " "      " "       " "       " "      
## 2  ( 2 ) " "      " "        " "      " "       " "       " "      
## 2  ( 3 ) " "      " "        " "      " "       " "       " "      
## 2  ( 4 ) " "      " "        " "      " "       " "       " "      
## 3  ( 1 ) " "      " "        " "      " "       " "       " "      
## 3  ( 2 ) " "      " "        " "      " "       " "       " "      
## 3  ( 3 ) " "      " "        "*"      " "       " "       " "      
## 3  ( 4 ) " "      " "        " "      " "       " "       "*"      
## 4  ( 1 ) " "      " "        " "      " "       " "       " "      
## 4  ( 2 ) " "      " "        " "      " "       " "       "*"      
## 4  ( 3 ) " "      " "        "*"      " "       " "       " "      
## 4  ( 4 ) " "      " "        " "      " "       "*"       " "      
## 5  ( 1 ) " "      " "        "*"      " "       " "       " "      
## 5  ( 2 ) " "      " "        " "      " "       " "       "*"      
## 5  ( 3 ) " "      " "        " "      " "       " "       " "      
## 5  ( 4 ) " "      " "        " "      " "       "*"       " "      
## 6  ( 1 ) " "      " "        "*"      " "       " "       "*"      
## 6  ( 2 ) " "      " "        "*"      " "       " "       " "      
## 6  ( 3 ) " "      " "        "*"      " "       " "       " "      
## 6  ( 4 ) " "      " "        "*"      " "       "*"       " "      
## 7  ( 1 ) " "      " "        "*"      " "       " "       "*"      
## 7  ( 2 ) " "      " "        "*"      " "       "*"       "*"      
## 7  ( 3 ) " "      " "        "*"      " "       " "       "*"      
## 7  ( 4 ) " "      " "        "*"      " "       " "       "*"      
## 8  ( 1 ) " "      " "        "*"      " "       " "       "*"      
## 8  ( 2 ) " "      " "        "*"      " "       "*"       "*"      
## 8  ( 3 ) " "      " "        "*"      " "       " "       "*"      
## 8  ( 4 ) " "      "*"        "*"      " "       " "       "*"
# plot a table of models showing variables in each model.
# models are ordered by the selection statistic.
plot(leaps,scale="r2")

leaps() function shows which model acheive highest R-squared value and which variables frequently appear on these models. By observing the graph, the highest R-squared value is resulted in the model that uses 8 predictors that are PTGENDER, PTEDUCAT, FDG, AV45, HippoNV, rs3818361, rs610932, rs3865444. All of 6 predictors used in our final model (PTEDUCAT,FDG,AV45,HippoNV,rs610932 ,rs3865444) were included in these 8 predictors.

—-limitations of linear regression——-

Passing significance test and fitting the model only mean that there is nothing significant against the model, meaning this is not the causal model therefore, other models can also fit the data possibliy.

assumptions of the estimations 1. the estimations of the regression parameters are independent (correlations are zero) 2. the variances of the regression parameters are the same

—-interaction terms ——

Interactions of the predictors could provide useful insights additionally. However, choosing which interaction terms are difficult. Careful and thorough analytics require in selecting interaction terms before including them into the model.

Scatter plots helps visualizing the relationship between any variable with the outcome. Insights on how the relationship changes according to another variables can be achieved.

For continuous predictors:

library(ggplot2)
library(GGally)
p <- ggpairs(AD[,c(17,1,3,4,5,6)], upper = list(continuous = "points")
             , lower = list(continuous = "cor")
)
print(p)

For categorical predictors:

# Boxplot
library(ggplot2)
qplot(factor(PTGENDER), TOTAL13, data = AD, 
      geom=c("boxplot"), fill = factor(PTGENDER))

qplot(factor(rs3818361), TOTAL13, data = AD, 
      geom=c("boxplot"), fill = factor(rs3818361))

qplot(factor(rs11136000), TOTAL13, data = AD, 
      geom=c("boxplot"), fill = factor(rs11136000))

qplot(factor(rs744373), TOTAL13, data = AD, 
      geom=c("boxplot"), fill = factor(rs744373))

qplot(factor(rs610932), TOTAL13, data = AD, 
      geom=c("boxplot"), fill = factor(rs610932))

qplot(factor(rs3865444), TOTAL13, data = AD, 
      geom=c("boxplot"), fill = factor(rs3865444))

# Histogram
library(ggplot2)
qplot(TOTAL13, data = AD, geom = "histogram",
      fill = factor(PTGENDER))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

qplot(TOTAL13, data = AD, geom = "histogram",
      fill = factor(rs3818361))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

qplot(TOTAL13, data = AD, geom = "histogram",
      fill = factor(rs11136000))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

qplot(TOTAL13, data = AD, geom = "histogram",
      fill = factor(rs744373))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

qplot(TOTAL13, data = AD[,c(10,12,15,17)], geom = "histogram",
      fill = factor(rs610932))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

qplot(TOTAL13, data = AD[,c(10,12,15,17)], geom = "histogram",
      fill = factor(rs3865444))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Decision tree

Using all variables:

library(rpart)
library(rpart.plot)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following object is masked from 'package:GGally':
## 
##     nasa
## The following object is masked from 'package:car':
## 
##     recode
## The following object is masked from 'package:MASS':
## 
##     select
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(tidyr)
## 
## Attaching package: 'tidyr'
## The following object is masked from 'package:RCurl':
## 
##     complete
library(ggplot2)
library(partykit)
## Loading required package: grid
theme_set(theme_gray(base_size = 15))
data <- AD_full

tree <- rpart(TOTAL13 ~ ., data, method="anova")
prp(tree, nn.cex = 1)

Because TOTAL13 is the continuous variables, the average of each pair of consecutive values is used as splitting value the values.

print(tree$variable.importance)
##        FDG    HippoNV       AV45        AGE 
## 10003.6912  4668.3064  2725.7887   968.0736

FDG has the largest importance scores among all variables.

While cp controls the model complexity, the tree can be pruned with the prune function to minimize relative error by splitting the node. less-complex tree can be created by a larger cp. cp = 0.05 and then cp = 0.2 were used to compare the complexity of the decision tree model.

tree_0.05 <- prune(tree, cp = 0.05)
prp(tree_0.05, nn.cex = 1)

tree_0.1 <- prune(tree, cp = 0.2)
prp(tree_0.1, nn.cex = 1)

As cp value becomes larger, the size of decision tree becomes smaller.

Finding the adequate number of the leaf nodes

In order to find the adquate size of the nodes, we conducted further analysis on the model. We used the half of the data points for training a decision tree and the rest of them for testing. As training error and testing error can be calculated for each structure, the number of leaf nodes can be found and used for complexity measurement of the tree.

set.seed(1)
train_sample <- sample(nrow(data),floor( nrow(data)/2) )
errintrain <- NULL
errintest <- NULL
leaf.v <- NULL

for(i in seq(0.2,0,by=-0.005) ){
  tree <- rpart( TOTAL13 ~ ., data = data[train_sample,], cp= i  ) 
  pred.train <- floor(predict(tree, data[train_sample,]))
  pred.test <- floor(predict(tree, data[-train_sample,]))
  current_error_train <- length(which(pred.train != data[train_sample,]$TOTAL13))/length(pred.train)
  current_error_test <- length(which(pred.test != data[-train_sample,]$TOTAL13))/length(pred.test)
  errintrain <- c(errintrain, current_error_train)
  errintest <- c(errintest, current_error_test)
  leaf.v <- c(leaf.v, length(which(tree$frame$var == "<leaf>")))
}
err.mat <- as.data.frame( cbind( train_err = errintrain, test_err = errintest , leaf_num = leaf.v ) )
err.mat$leaf_num <- as.factor( err.mat$leaf_num  )
err.mat <- unique(err.mat)
err.mat <- err.mat %>% gather(type, error, train_err,test_err)
print(err.mat)
##    leaf_num      type     error
## 1         2 train_err 0.9612403
## 2         3 train_err 0.9224806
## 3         4 train_err 0.9263566
## 4         5 train_err 0.9147287
## 5         6 train_err 0.9108527
## 6         8 train_err 0.9147287
## 7        18 train_err 0.9069767
## 8        21 train_err 0.8759690
## 9         2  test_err 0.9382239
## 10        3  test_err 0.9382239
## 11        4  test_err 0.9305019
## 12        5  test_err 0.9150579
## 13        6  test_err 0.9266409
## 14        8  test_err 0.9266409
## 15       18  test_err 0.9498069
## 16       21  test_err 0.9498069

The test errors and train errors according to different size of the leaf nodes are plotted. Train errors generatlly decrease while the test errerors first decrease and then increase at leaf number eqauals to 5. Therefore, The adequate number of leaf node would be 5. Other leaf numbers may result overfitting of predicted data.

data.plot <- err.mat %>% mutate(type = type)
ggplot(data.plot, aes(x=leaf_num, y=error, shape = type, color=type)) + geom_line() +
  geom_point(size=5) 
## geom_path: Each group consists of only one observation. Do you need to
## adjust the group aesthetic?

Final decision tree model can be selected with the 5 decision (leaf) nodes WITH cp = 0.02

tree_0.020 <- prune(tree, cp = 0.02)
prp(tree_0.05, nn.cex = 1)

Comparison between regression model and tree model

Decision tree used the 4 variables : FDG, HippoNV, AV45, AGE for the 5 decision nodes (using FDG twice with different threasholds). Three variables including FDB, HippoNV, AV45 are the commonly shared with the fianl regression model with 6 valueables (PTEDUCAT,FDG,AV45,HippoNV,rs610932 ,rs3865444).

Mean square error (MSE) can be used for comparing the two models.

MSE of our regression model can be calculated as:

# install.packages("Metrics")
library(Metrics)

predictedvalue <- predict(lm.AD.F,data)
mse(data$TOTAL13,predictedvalue)
## [1] 36.17068
#double check the function mse() is working
mean((data$TOTAL13-predictedvalue)^2)
## [1] 36.17068

MSE of our decision tree model can be calculated as:

# install.packages("Metrics")
# library(Metrics)

predictedvalue <- predict(tree_0.020,data)
mse(data$TOTAL13,predictedvalue)
## [1] 37.1621
#double check the function mse() is working
mean((data$TOTAL13-predictedvalue)^2)
## [1] 37.1621

As MSE of regression model is lower than MSE of tree model. Therefore, we should choose regression model over tree model to minimize the errors.

Reference:

textbook, PennState Eberly College of Science (https://onlinecourses.science.psu.edu/stat501), University of Virginia Library (http://data.library.virginia.edu/understanding-q-q-plots/)

Question 2

Find two data sets from the UCI data repository or R datasets. Conduct a detailed regression analysis for both datasets using both regression model and the tree model (for regression), e.g., for regression model, you may want to conduct model selection, model comparison, testing of the significance of the regression parameters, evaluation of the R-squared and significance of the model. Also comment on the application of your model on the context of the dataset you have selected.

Car Dataset

The first dataset we chose for modeling is the cu.summary dataset that contains automobile data taken from the April, 1990 issue of “Consumer Reports”. The dataset contains 117 observations and 5 features. We are trying to predict the price of the car. A table of feature descriptions is provided below.

Feature Name Description
Price a numeric vector giving the list price in US dollars of a standard model
Country Country of origin
Reliability an ordered factor with levels ‘Much worse’ < ‘worse’ < ‘average’ < ‘better’ < ‘Much better’
Mileage fuel consumption miles per US gallon
Type a factor with levels Compact Large Medium Small Sporty Van

Linear Regression

We fit a linear regression model with all of the features to predict the price of the car. Looking at the summary, we see that this linear regression model explains 83.7% of the variability in the data.

lm.car <- lm(Price~., data=df.car)
summary(lm.car)
## 
## Call:
## lm(formula = Price ~ ., data = df.car)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3078.1 -1048.3     3.8   920.1  5341.2 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)             18992.2     3389.0   5.604 3.42e-06 ***
## CountryJapan             1213.8     2719.7   0.446  0.65838    
## CountryJapan/USA         -154.6     2756.3  -0.056  0.95562    
## CountryKorea            -2203.6     2962.6  -0.744  0.46243    
## CountryMexico           -3176.7     3298.4  -0.963  0.34272    
## CountrySweden            5558.9     3164.0   1.757  0.08850 .  
## CountryUSA              -2625.2     2566.9  -1.023  0.31411    
## Reliabilitybetter        1853.7     1460.1   1.270  0.21340    
## ReliabilityMuch better   -913.6     1452.3  -0.629  0.53377    
## ReliabilityMuch worse    -415.1     1267.3  -0.328  0.74538    
## Reliabilityworse         -822.7     1266.5  -0.650  0.52057    
## Mileage                  -265.3      129.7  -2.044  0.04921 *  
## TypeLarge                5140.8     1686.4   3.048  0.00459 ** 
## TypeMedium               5431.7     1070.7   5.073 1.61e-05 ***
## TypeSmall               -2100.3     1346.9  -1.559  0.12875    
## TypeSporty                894.6     1282.7   0.697  0.49057    
## TypeVan                  1723.1     1780.6   0.968  0.34045    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2091 on 32 degrees of freedom
##   (68 observations deleted due to missingness)
## Multiple R-squared:  0.837,  Adjusted R-squared:  0.7555 
## F-statistic: 10.27 on 16 and 32 DF,  p-value: 1.899e-08

We use the stepAIC function to identify and discard the insignificant features in our model. Looking at the summary, we see that none of the features have been discarded, showing that all of the features in our initial linear regression are significant. Also, let us note that our initial model is the optimal linear regression model given the data.

Looking at the summary, we see that the features which increase the price are: the size of the vehicle (medium or large), a better reliability, the country it was made in (Japan and Sweden), and the type of vehicle (sporty or van). Thus, three out of the four factors positively affect the price. The coefficients of the other features were all negative.

lm.car <- stepAIC(lm.car, direction="both")
## Start:  AIC=762.39
## Price ~ Country + Reliability + Mileage + Type
## 
##               Df Sum of Sq       RSS    AIC
## <none>                     139957482 762.39
## - Reliability  4  27132444 167089926 763.07
## - Mileage      1  18280336 158237818 766.40
## - Country      6  80914089 220871570 772.74
## - Type         5 168527460 308484942 791.11
summary(lm.car)
## 
## Call:
## lm(formula = Price ~ Country + Reliability + Mileage + Type, 
##     data = df.car)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3078.1 -1048.3     3.8   920.1  5341.2 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)             18992.2     3389.0   5.604 3.42e-06 ***
## CountryJapan             1213.8     2719.7   0.446  0.65838    
## CountryJapan/USA         -154.6     2756.3  -0.056  0.95562    
## CountryKorea            -2203.6     2962.6  -0.744  0.46243    
## CountryMexico           -3176.7     3298.4  -0.963  0.34272    
## CountrySweden            5558.9     3164.0   1.757  0.08850 .  
## CountryUSA              -2625.2     2566.9  -1.023  0.31411    
## Reliabilitybetter        1853.7     1460.1   1.270  0.21340    
## ReliabilityMuch better   -913.6     1452.3  -0.629  0.53377    
## ReliabilityMuch worse    -415.1     1267.3  -0.328  0.74538    
## Reliabilityworse         -822.7     1266.5  -0.650  0.52057    
## Mileage                  -265.3      129.7  -2.044  0.04921 *  
## TypeLarge                5140.8     1686.4   3.048  0.00459 ** 
## TypeMedium               5431.7     1070.7   5.073 1.61e-05 ***
## TypeSmall               -2100.3     1346.9  -1.559  0.12875    
## TypeSporty                894.6     1282.7   0.697  0.49057    
## TypeVan                  1723.1     1780.6   0.968  0.34045    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2091 on 32 degrees of freedom
##   (68 observations deleted due to missingness)
## Multiple R-squared:  0.837,  Adjusted R-squared:  0.7555 
## F-statistic: 10.27 on 16 and 32 DF,  p-value: 1.899e-08

We now analyze our final linear regression model by use of the residual vs fitted and QQ-plots. The residual vs fitted plot shows randomly distributed points, which means that the regression assumptions have not been violated. The QQ-plot shows points clustered along the line indicating that the dependent variables can plausibly be normally distributed.

par(mar = rep(2, 4)) # Change the margins
plot(lm.car, which=c(1))

plot(lm.car, which=c(2))
## Warning: not plotting observations with leverage one:
##   6, 12, 32

This model can be used to study which factors have the biggest impact on increasing the price of a vehicle. Understanding the extent in which certain features inflate the price of a vehicle can help manufacturers decide which areas they should focus on when attempting to decrease their prices.

Decision Tree

A model of the decision tree fitted to our dataset is produced below. The decision tree only incorporates two out of the four features; the tree incorporates the type and country of origin of each vehicle while it discards the information pertaining to the reliability and mileage of the vehicle. In contrast, the linear regression model incorporated all four of the given features and showed that including three of the features would always positively affect the price. This discrepancy between models suggests that each model may be able to better explain certain characteristics of the data.

library("rpart.plot")

tr.car <- rpart(Price~., data=df.car)
prp(tr.car, varlen=3)

Males Dataset

The second dataset we chose for modeling is the Males dataset that recorded data about the wages and education of young males. The dataset contained 4360 observations and 12 features. We are trying to predict the wage of the person. A table of feature descriptions is provided below.

Feature Name Description
nr unique identifier
year year data was collected
school years of schooling
exper years of experience
union wage set by collective bargaining?
ethn ethnicity
married are they married?
health health problems?
wage log of hourly wage
industry work industry
occupation job occupation
residence area of residence

Linear Regression

We first fit an initial linear regression model with all of the features to predict the wage of the individual. From looking at the summary, the initial linear regression model is able to explain 28.54% of the variability of the wage.

lm.males <- lm(wage~., data=df.males)

Next we used some feature selection using stepwise regresssion to fit the best model. The final model produced from the stepwise regression found all but the health feature to be signficant. By analyzing the difference in the models, we can see that the R-squared was only reduced to 28.53%. The final found all included features to be significant at a 0.05 level. When looking at the coeffecients, that years of schooling, year of experience, union, ethnicity, and being married all had a positive impact on wages per unit increase while keeping all others constant. Other coeffecients were a mix of positive and negative values depending on the specific categorical value taken.

lm.males <- stepAIC(lm.males, direction="both")
## Start:  AIC=-4910.9
## wage ~ nr + year + school + exper + union + ethn + married + 
##     health + industry + occupation + residence
## 
##              Df Sum of Sq    RSS     AIC
## - health      1     0.141 630.89 -4912.2
## <none>                    630.75 -4910.9
## - nr          1     1.394 632.14 -4906.0
## - exper       1     1.405 632.15 -4906.0
## - year        1     3.152 633.90 -4897.4
## - ethn        2     3.879 634.63 -4895.8
## - married     1     4.809 635.56 -4889.2
## - residence   3     9.557 640.30 -4870.1
## - occupation  8    15.155 645.90 -4852.9
## - union       1    15.827 646.57 -4835.7
## - school      1    28.968 659.71 -4773.0
## - industry   11    50.919 681.67 -4691.1
## 
## Step:  AIC=-4912.21
## wage ~ nr + year + school + exper + union + ethn + married + 
##     industry + occupation + residence
## 
##              Df Sum of Sq    RSS     AIC
## <none>                    630.89 -4912.2
## + health      1     0.141 630.75 -4910.9
## - nr          1     1.389 632.28 -4907.4
## - exper       1     1.417 632.30 -4907.2
## - year        1     3.155 634.04 -4898.7
## - ethn        2     3.862 634.75 -4897.2
## - married     1     4.785 635.67 -4890.7
## - residence   3     9.567 640.45 -4871.3
## - occupation  8    15.338 646.23 -4853.4
## - union       1    16.019 646.91 -4836.1
## - school      1    29.104 659.99 -4773.7
## - industry   11    50.937 681.82 -4692.3
summary(lm.males)
## 
## Call:
## lm(formula = wage ~ nr + year + school + exper + union + ethn + 
##     married + industry + occupation + residence, data = df.males)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.1925 -0.2273  0.0224  0.2568  2.2815 
## 
## Coefficients:
##                                                 Estimate Std. Error
## (Intercept)                                   -5.836e+01  1.487e+01
## nr                                            -6.937e-06  2.662e-06
## year                                           2.957e-02  7.531e-03
## school                                         7.621e-02  6.389e-03
## exper                                          1.754e-02  6.666e-03
## unionyes                                       1.794e-01  2.028e-02
## ethnhisp                                       9.749e-02  3.459e-02
## ethnother                                      1.133e-01  2.620e-02
## marriedyes                                     8.652e-02  1.789e-02
## industryBusiness_and_Repair_Service            2.349e-01  6.760e-02
## industryConstruction                           3.316e-01  6.837e-02
## industryEntertainment                         -1.451e-01  9.022e-02
## industryFinance                                4.822e-01  7.558e-02
## industryManufacturing                          3.814e-01  6.245e-02
## industryMining                                 4.450e-01  9.605e-02
## industryPersonal_Service                       1.984e-01  8.611e-02
## industryProfessional_and_Related Service       6.897e-02  6.810e-02
## industryPublic_Administration                  2.905e-01  7.309e-02
## industryTrade                                  1.485e-01  6.279e-02
## industryTransportation                         4.458e-01  6.739e-02
## occupationCraftsmen, Foremen_and_kindred       5.734e-02  3.162e-02
## occupationFarm_Laborers_and_Foreman            1.046e-02  9.341e-02
## occupationLaborers_and_farmers                -3.476e-02  3.710e-02
## occupationManagers, Officials_and_Proprietors  1.458e-01  3.611e-02
## occupationOperatives_and_kindred              -3.779e-02  3.229e-02
## occupationProfessional, Technical_and_kindred  1.732e-01  3.580e-02
## occupationSales_Workers                        6.662e-02  4.391e-02
## occupationService_Workers                     -5.474e-02  3.419e-02
## residencenothern_central                      -1.457e-01  2.285e-02
## residencerural_area                           -1.034e-01  5.429e-02
## residencesouth                                -1.242e-01  2.172e-02
##                                               t value Pr(>|t|)    
## (Intercept)                                    -3.923 8.92e-05 ***
## nr                                             -2.606 0.009199 ** 
## year                                            3.927 8.79e-05 ***
## school                                         11.928  < 2e-16 ***
## exper                                           2.632 0.008537 ** 
## unionyes                                        8.849  < 2e-16 ***
## ethnhisp                                        2.819 0.004855 ** 
## ethnother                                       4.325 1.57e-05 ***
## marriedyes                                      4.837 1.39e-06 ***
## industryBusiness_and_Repair_Service             3.475 0.000518 ***
## industryConstruction                            4.849 1.30e-06 ***
## industryEntertainment                          -1.609 0.107776    
## industryFinance                                 6.381 2.03e-10 ***
## industryManufacturing                           6.108 1.14e-09 ***
## industryMining                                  4.633 3.76e-06 ***
## industryPersonal_Service                        2.304 0.021311 *  
## industryProfessional_and_Related Service        1.013 0.311247    
## industryPublic_Administration                   3.975 7.20e-05 ***
## industryTrade                                   2.365 0.018088 *  
## industryTransportation                          6.615 4.37e-11 ***
## occupationCraftsmen, Foremen_and_kindred        1.813 0.069878 .  
## occupationFarm_Laborers_and_Foreman             0.112 0.910864    
## occupationLaborers_and_farmers                 -0.937 0.348912    
## occupationManagers, Officials_and_Proprietors   4.038 5.53e-05 ***
## occupationOperatives_and_kindred               -1.170 0.242006    
## occupationProfessional, Technical_and_kindred   4.839 1.37e-06 ***
## occupationSales_Workers                         1.517 0.129320    
## occupationService_Workers                      -1.601 0.109440    
## residencenothern_central                       -6.375 2.10e-10 ***
## residencerural_area                            -1.905 0.056834 .  
## residencesouth                                 -5.719 1.17e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4523 on 3084 degrees of freedom
##   (1245 observations deleted due to missingness)
## Multiple R-squared:  0.2853, Adjusted R-squared:  0.2783 
## F-statistic: 41.03 on 30 and 3084 DF,  p-value: < 2.2e-16

After obtaining this final model, we need to analyze the residual vs fitted plot and the QQ-plot to see if any of the linear regression assumptions are violated. From looking at the residual vs fitted plot, the points seem to be randomly distrubted and there are no trends as fitted values increase, so there does not seem to be any violations. When looking at the QQ-plot, there seems to be deviation from normality assumption.

plot(lm.males, which=c(1))

plot(lm.males, which=c(2))

The applications of this model could be used to study what types of features are correlated most with a high wage. This could be important in helping to decide policy of whether more education helps to increase wage and to study if there are discrepencies among different socioeconomic factors.

Decision Tree

A decision tree was fit to the data and the output of the model is seen below. Compared to the regression model, the decision tree only incorporated the use of 5 out of 11 features. These features are industry, years of schooling, year the data was collected, married, and years of work experience. From the analyzing the rules the decision tree came up with, the observations with the highest wage are those who have 12+ years of school and are not a part of the agriculture, construction, entertainment, personal services, professional services, and transportation. This type can be used to see if there are any underlying patterns in the data that a linear regression might not be able to ascertain and can be used to predict wages of young professional males.

tr.males <- rpart(wage~.,data=df.males)
prp(tr.males, nn.cex=1)

Question 5

Build a decision tree model based on the following dataset. Don’t use R. Use your pen and paper, and show the process.

df.q5 <- data.frame(x1=c(0.22,0.58,0.57,0.41,0.60,0.12,0.25,0.32), 
                    x2=c(0.38,0.32,0.28,0.43,0.29,0.32,0.32,0.38), 
                    y=factor(c("No","Yes","Yes","Yes","No","Yes","Yes","No")))

We chose an arbitrary rule of splitting on the 4 quantiles. The first split is on x2 at the 3rd quantile, which is \(x2 \leq 0.38\). The left node will contain data points (1,2,3,5,6,7,8) and the right node will contain data point (4).

\[e_{root} = -\frac{5}{8} log_{2}\frac{5}{8} - \frac{3}{8} log_{2}\frac{3}{8} = 0.9544\] \[e_{x2 \leq 0.38} = -\frac{4}{7} log_{2}\frac{4}{7} - \frac{3}{7} log_{2}\frac{3}{7} = 0.9852\] \[e_{x2 \gt 0.38} -\frac{1}{1} log_{2}\frac{1}{1} - \frac{0}{1} log_{2}\frac{0}{1} = 0\] \[IG = e_{root} - \frac{7}{8}*e_{x2 \leq 0.38}-\frac{1}{8}*e_{x2 \gt 0.38} = 0.0924\]

The second split was done on left node containing data points (1,2,3,5,6,7,8) on x2 at the 3rd quantile, which is \(x2 \leq 0.35\). The left node will contain data points (2,3,5,6,7) and the right node will contain data points (1,8).

\[e_{root} = -\frac{4}{7} log_{2}\frac{4}{7} - \frac{3}{7} log_{2}\frac{3}{7} = 0.9852\] \[e_{x2 \leq 0.35} = -\frac{4}{5} log_{2}\frac{4}{5} - \frac{1}{5} log_{2}\frac{1}{5} = 0.7219\] \[e_{x2 \gt 0.35} -\frac{0}{2} log_{2}\frac{0}{2} - \frac{2}{2} log_{2}\frac{2}{2} = 0\] \[IG = e_{root} - \frac{5}{7}*e_{x2 \leq 0.35}-\frac{2}{7}*e_{x2 \gt 0.35} = 0.4696\]

The third split was on the left node containin data points (2,3,5,6,7) on x1 at the 3rd quantile, which is \(x1 \leq 0.58\). The left node will contain data points (2,3,6,7) and the right node will contain data point (5).

\[e_{root} = -\frac{4}{5} log_{2}\frac{4}{5} - \frac{1}{5} log_{2}\frac{1}{5} = 0.7219\] \[e_{x1 \leq 0.58} = -\frac{4}{4} log_{2}\frac{4}{4} - \frac{0}{4} log_{2}\frac{0}{4} = 0\] \[e_{x1 \gt 0.58} -\frac{0}{1} log_{2}\frac{0}{1} - \frac{1}{1} log_{2}\frac{1}{1} = 0\] \[IG = e_{root} - \frac{4}{5}*e_{x1 \leq 0.58}-\frac{1}{5}*e_{x1 \gt 0.58} = 0.7219\]

Question 6

Write your own R script to implement the least squares estimation of a regression model. Compare the output from your script with the output from lm().

We arbitrarily choose the number of input variables as p = 3, and define the following function.

#least_squares: Returns the least squares estimation of a regression model
least_squares <- function(x1, x2, x3, y){ 
  
  # Creating the X and Y matrices
  x_matrix = as.matrix(cbind(rep(1, 20), x1, x2, x3))
  y_matrix = as.matrix(y)
  
  # Computing the beta estimator
  beta_estimator = solve( t(x_matrix) %*% x_matrix ) %*% t(x_matrix) %*% y_matrix
  
  return(beta_estimator)
} 

We can now compare our function against the output from lm()

# Choose random x1, x2, x3 inputs
x1 = runif(1:20)
x2 = runif(1:20)
x3 = runif(1:20)

# Create an arbitrary y 
y = 4 + 2*x1 + 1*x2 + 8*x3

# Call our function: least_squares 
least_squares(x1, x2, x3, y)
##    [,1]
##       4
## x1    2
## x2    1
## x3    8
# Calling the R function for linear regression: lm 
lm(y~., data = data.frame(x1, x2, x3, y))
## 
## Call:
## lm(formula = y ~ ., data = data.frame(x1, x2, x3, y))
## 
## Coefficients:
## (Intercept)           x1           x2           x3  
##           4            2            1            8